The dataset being analyzed is gathered from PISA, which is a survey depicting students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. This test focuses on how well prepared are students for life beyond school. Unlike many conventional tests which evaluate how well students have learned the school curriculum. The PISA test evaluates the more practical and real life scenario of a student's journey through life after school.
Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
df_pisa = pd.read_csv('pisa2012.csv', encoding='latin-1')
print(df_pisa.shape)
df_pisa.head()
df_pisa[df_pisa.duplicated(['STIDSTD'], keep=False)]
df_pisa['STIDSTD'].value_counts()
df_pisa[df_pisa['STIDSTD'] == 135].head(10)
pd.Series(df_pisa['STIDSTD']).is_unique
- PV1MATH to PV5MATH are for mathematical literacy;
- PV1SCIE to PV5SCIE for scientific literacy
- PV1READ to PV5READ for reading literacy
Now I will create 3 variables each of which will be created from average of 5 plausible variables found in each of mathematical, scientific, reading literacy, as shown below. The following 3 variables that will be created are called:
- 'Avg_Math_Literacy'
- 'Avg_Scientific_Literacy'
- 'Avg_Reading_Literacy'
df_pisa[['PV1MATH','PV2MATH','PV3MATH', 'PV4MATH', 'PV5MATH']].describe()
df_pisa['Avg_Math_Literacy'] = df_pisa[['PV1MATH','PV2MATH','PV3MATH', 'PV4MATH', 'PV5MATH']].mean(axis=1)
df_pisa[['PV1SCIE','PV2SCIE','PV3SCIE','PV4SCIE','PV5SCIE']].describe()
df_pisa['Avg_Scientific_Literacy'] = df_pisa[['PV1SCIE','PV2SCIE','PV3SCIE','PV4SCIE','PV5SCIE']].mean(axis=1)
df_pisa[['PV1READ','PV2READ','PV3READ', 'PV4READ', 'PV5READ']].describe()
df_pisa['Avg_Reading_Literacy'] = df_pisa[['PV1READ','PV2READ','PV3READ', 'PV4READ', 'PV5READ']].mean(axis=1)
df_pisa[['Avg_Math_Literacy',
'Avg_Scientific_Literacy',
'Avg_Reading_Literacy']].head()
Now I will create one last variable based on the 3 newly created variables. This variable will be called "Overall_Literacy" and this will be composed off the average of the 3 previously created variables.
df_pisa['Overall_Literacy'] = df_pisa[['Avg_Math_Literacy',
'Avg_Scientific_Literacy',
'Avg_Reading_Literacy']].mean(axis=1)
df_pisa[['Avg_Math_Literacy',
'Avg_Scientific_Literacy',
'Avg_Reading_Literacy','Overall_Literacy']].head()
Another variable is to be created using the following three variables:
- SMINS - Learning time (minutes per week) - Science
- MMINS - Learning time (minutes per week) - Maths
- LMINS - Learning time (minutes per week) - Language
The variable will be called 'Average_Learning_Time' and it will be the average of the 3 variables mentioned above.
df_pisa[['SMINS', 'MMINS', 'LMINS']].describe()
df_pisa['Average_Learning_Time'] = df_pisa[['SMINS', 'MMINS', 'LMINS']].mean(axis=1)
df_pisa[['SMINS', 'MMINS', 'LMINS', 'Average_Learning_Time']].sample(7)
- CNT - Country (Categorical)
OECD - Is OECD or not (Categorical)
PV1MATH to PV5MATH - mathematical literacy (Quantitative)
- PV1SCIE to PV5SCIE - scientific literacy (Quantitative)
- PV1READ to PV5READ - reading literacy (Quantitative)
- 'Avg_Math_Literacy', 'Avg_Scientific_Literacy' and 'Avg_Reading_Literacy' - (Quantitative)
Overall_Literacy - Engineered Variable (Quantitative)
ST01Q01 - Student Grade out of 100 (Quantitative)
- GRADE - Grade compared to modal grade in country (Quantitative). The relative grade index (GRADE) was computed to capture between-country variation. It indicates whether students are in the country’s a modal grade i (value of 0) or whether they are below or above the modal grade (+x grades, -x grades). The information about the students’ grade level was taken from the Student Questionnaire (ST001) whereas the modal grade was defined by the country and documented in the student tracking form.
- ST04Q01 - Gender (Categorical)
- ST93Q01 - Perseverance: Give up easily (Categorical)
- IC01Q01 to IC01Q07 - computer, internet, cellphone at home
- IC08Q08 - Obtain practical information from the Internet (Categorical)
- ST22Q01 - Acculturation: Mother Immigrant (Filter) (Categorical)
- BFMJ2 - Father SQ ISEI: higher ISEI scores indicate higher levels of occupational status. (Quantitative)
- BMMJ1 - Mother SQ ISEI: higher ISEI scores indicate higher levels of occupational status. (Quantitative)
- HISCED - Highest educational level of parents (Categorical)
- HISEI - Highest parental occupational status (Quantitative)
- FISCED - Educational level of father (ISCED) (Categorical)
- PARED - Highest parental education in years (Quantitative)
REPEAT - Grade Repetition (class repeated) (Categorical)
SMINS - Learning time (minutes per week) - Science (Quantitative)
- MMINS - Learning time (minutes per week) - Maths (Quantitative)
- LMINS - Learning time (minutes per week) - Language (Quantitative)
Average_Learning_Time - Engineered Variable (Quantitative)
TIMEINT - Time of computer use (mins)
- CNT - Country (Categorical)
- OECD - Is OECD or not (Categorical)
- ST04Q01 - Gender (Categorical)
- GRADE - Grade compared to modal grade in country (Quantitative).
- Overall_Literacy - Engineered Variable (Quantitative)
- Average_Learning_Time - Engineered Variable (Quantitative)
- ST93Q01 - Perseverance: Give up easily (Categorical)
- IC01Q01 to IC01Q07 - computer, internet, cellphone at home
- ST22Q01 - Acculturation: Mother Immigrant (Filter) (Categorical)
- HISCED - Highest educational level of parents (Categorical)
- HISEI - Highest parental occupational status (Quantitative)
- PARED - Highest parental education in years (Quantitative)
- REPEAT - Grade Repetition (class repeated) (Categorical)
- TIMEINT - Time of computer use (mins)
df_pisa = df_pisa[['CNT','OECD','ST04Q01','Overall_Literacy','GRADE','Average_Learning_Time','ST93Q01','IC01Q01','IC01Q02','IC01Q03','IC01Q04','IC01Q05','IC01Q06','IC01Q07','ST22Q01','HISCED','HISEI','PARED','REPEAT','TIMEINT']]
df_pisa.head()
df_pisa.rename(columns = {'CNT':'Country',
'GRADE' : 'Grade',
'ST04Q01' : 'Gender',
'ST93Q01' : 'Perseverance_Give_up_easily',
'IC01Q01' : 'Desktop_at_Home',
'IC01Q02' : 'Portable_Laptop_at_Home',
'IC01Q03' : 'Tablet_Computer_at_Home',
'IC01Q04' : 'Internet_Connection',
'IC01Q05' : 'Video_Games_Console',
'IC01Q06' : 'at_Home_Cell_phone_w/o_Internet',
'IC01Q07' : 'at_Home_Cell_phone_with_Internet',
'ST22Q01' : 'Mother_Immigrant',
'HISCED' : 'Highest_educational_level_parents',
'HISEI' : 'Highest_parental_occupational_status',
'PARED' : 'Highest_parental_education_years',
'REPEAT' : 'Class_repeated',
'TIMEINT' : 'Computer_use_mins'}, inplace = True)
df_pisa.head()
df_pisa.info()
Any column which has the missing data value to be of more than 60-70% of the dataset will be removed. Hence, The column called 'Mother_Immigrant' is removed.
df_pisa.isnull().sum().sort_values(ascending=False)/485490*100
df_pisa.isnull().sum().sort_values(ascending=False)
Dropping Mother_Immigrant Column due to excessive missing values
df_pisa = df_pisa.drop(['Mother_Immigrant'], axis = 1)
df_pisa.columns
df_pisa_final = df_pisa.dropna()
df_pisa_final.shape
print('The cleaned dataframe is composed of only',(81985)/485490*100,'% of the original dataset.')
Error 1: Removing '<>' from the 'Class_repearted' column data.
Error 2: Removing unreadable syntax from the columns which have 'Yes, but I donÂ’t use it' replacing it with 'Yes, but I don't use it'
df_pisa_final.Class_repeated.value_counts()
df_pisa_final['Class_repeated'] = df_pisa_final['Class_repeated'].str.replace("<","")
df_pisa_final['Class_repeated'] = df_pisa_final.Class_repeated.str.strip('>')
df_pisa_final.Class_repeated.value_counts()
df_pisa_final.Desktop_at_Home.value_counts(dropna=False)
df_pisa_final["Desktop_at_Home"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final.Desktop_at_Home.value_counts(dropna=False)
df_pisa_final["Portable_Laptop_at_Home"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final["Tablet_Computer_at_Home"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final["Internet_Connection"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final["Video_Games_Console"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final["at_Home_Cell_phone_w/o_Internet"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final["at_Home_Cell_phone_with_Internet"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final["Highest_educational_level_parents"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final["Class_repeated"].replace({"Yes, but I donÂ’t use it": "Yes, but I don't use it"}, inplace=True)
df_pisa_final.info()
df_pisa_final['OECD'] = df_pisa_final.OECD.astype('category')
df_pisa_final['OECD'].dtypes
df_pisa_final['Gender'] = df_pisa_final.Gender.astype('category')
df_pisa_final['Perseverance_Give_up_easily'] = df_pisa_final.Perseverance_Give_up_easily.astype('category')
df_pisa_final['Desktop_at_Home'] = df_pisa_final.Desktop_at_Home.astype('category')
df_pisa_final['Portable_Laptop_at_Home'] = df_pisa_final.Portable_Laptop_at_Home.astype('category')
df_pisa_final['Tablet_Computer_at_Home'] = df_pisa_final.Tablet_Computer_at_Home.astype('category')
df_pisa_final['Internet_Connection'] = df_pisa_final.Internet_Connection.astype('category')
df_pisa_final['Video_Games_Console'] = df_pisa_final.Video_Games_Console.astype('category')
df_pisa_final['at_Home_Cell_phone_w/o_Internet'] = df_pisa_final['at_Home_Cell_phone_w/o_Internet'].astype('category')
df_pisa_final['at_Home_Cell_phone_with_Internet'] = df_pisa_final.at_Home_Cell_phone_with_Internet.astype('category')
df_pisa_final['Highest_educational_level_parents'] = df_pisa_final.Highest_educational_level_parents.astype('category')
df_pisa_final['Class_repeated'] = df_pisa_final.Class_repeated.astype('category')
df_pisa_final.info()
df_pisa_final.Country.value_counts(dropna=False, ascending=False)
df_pisa_final.OECD.value_counts(dropna=False)
print(df_pisa_final.Grade.describe())
print('\n')
print(df_pisa_final.Grade.value_counts(dropna=False))
print(df_pisa_final.Overall_Literacy.describe())
print('\n')
print(df_pisa_final.Overall_Literacy.value_counts(dropna=False))
df_pisa_final.Perseverance_Give_up_easily.value_counts(dropna=False)
df_pisa_final.Desktop_at_Home.value_counts(dropna=False)
df_pisa_final.Portable_Laptop_at_Home.value_counts(dropna=False)
df_pisa_final.Tablet_Computer_at_Home.value_counts(dropna=False)
df_pisa_final.Internet_Connection.value_counts(dropna=False)
df_pisa_final.Video_Games_Console.value_counts(dropna=False)
df_pisa_final['at_Home_Cell_phone_w/o_Internet'].value_counts(dropna=False)
df_pisa_final.at_Home_Cell_phone_with_Internet.value_counts(dropna=False)
df_pisa_final.Highest_educational_level_parents.value_counts(dropna=False)
print(df_pisa_final.Highest_parental_occupational_status.describe())
print('\n')
print(df_pisa_final.Highest_parental_occupational_status.value_counts(dropna=False))
df_pisa_final.Highest_parental_education_years.value_counts(dropna=False)
df_pisa_final.Class_repeated.value_counts(dropna=False)
print(df_pisa_final[['Average_Learning_Time']].describe())
print('\n')
print(df_pisa_final.Average_Learning_Time.value_counts(dropna=False))
print(df_pisa_final.Computer_use_mins.describe())
print('\n')
print(df_pisa_final.Computer_use_mins.value_counts(dropna=False))
In this section, I will be investigating distributions of individual variables. I will also be monitoring any unusual points or outliers, and will further investigate to clean things up. The cleaning process will help to look at relationships between variables.
plt.figure(figsize=(15,16))
sb.countplot(data=df_pisa_final, y= 'Country', color='darkblue', order = df_pisa_final['Country'].value_counts(ascending=False).index);
The variable investigated above is students' country of residence, a countplot was plotted to depict which countries have the most number of entries. Mexico has the most while Liechtenstein has the least number of students according to PISA. I plan on investigating how country of residence affects the overall literacy of students.
df_pisa_final[['Grade']].describe()
sb.countplot(data=df_pisa_final, x='Grade', color='darkblue');
Grade is the variable analyzed here, there are no outliers as the data is tidy. I plan on investigating Grade, Overall_Literacy and other factors in conjunction.
df_pisa_final[['Overall_Literacy']].describe()
bin_edges = np.arange(df_pisa_final.Overall_Literacy.min(),df_pisa_final.Overall_Literacy.max()+50,50)
plt.hist(data=df_pisa_final, x='Overall_Literacy', bins= bin_edges, color='darkblue');
plt.xlabel('Overall Literacy');
plt.ylabel('Frequency');
Overall_Literacy is the variable analyzed here, the data seems normally distributed with the peak at 500 interval. I plan on using Overall_Literacy as the dependent variable.
df_pisa_final[['Average_Learning_Time']].describe()
sb.boxplot(y=df_pisa_final['Average_Learning_Time'], color='darkred');
The following code is taken from here
def subset_by_iqr(df, column, whisker_width=1.5):
"""Remove outliers from a dataframe by column, including optional
whiskers, removing rows for which the column value are
less than Q1-1.5IQR or greater than Q3+1.5IQR.
Args:
df (`:obj:pd.DataFrame`): A pandas dataframe to subset
column (str): Name of the column to calculate the subset from.
whisker_width (float): Optional, loosen the IQR filter by a
factor of `whisker_width` * IQR.
Returns:
(`:obj:pd.DataFrame`): Filtered dataframe
"""
# Calculate Q1, Q2 and IQR
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = q3 - q1
# Apply filter with respect to IQR, including optional whiskers
filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr)
return df.loc[filter]
# Example for whiskers = 1.5, as requested by the OP
df_pisa_final = subset_by_iqr(df_pisa_final, 'Average_Learning_Time', whisker_width=1.5)
Average Learning Time (mins/week) is the variable analyzed here, the boxplot shows that there are alot of outliers. The outliers were cleaned using the function created above.
sb.boxplot(y=df_pisa_final['Average_Learning_Time'], color='darkred');
df_pisa_final[['Average_Learning_Time']].describe()
bin_edges = np.arange(df_pisa_final.Average_Learning_Time.min(),df_pisa_final.Average_Learning_Time.max()+10,30)
plt.hist(data=df_pisa_final, x='Average_Learning_Time', bins= bin_edges, color='darkblue');
plt.xlabel('Average Learning Time (mins/week)');
plt.ylabel('Frequency');
Now after cleaning the data, Average Learning Time's distribution seems normally distributed with the peak at around 200 mark. I plan on investigating how Average Learning Time affects the overall literacy of students.
df_pisa_final.Highest_parental_occupational_status.describe()
bin_edges = np.arange(df_pisa_final.Highest_parental_occupational_status.min(),df_pisa_final.Highest_parental_occupational_status.max()+5,10)
plt.hist(data=df_pisa_final, x='Highest_parental_occupational_status', bins= bin_edges, color='darkblue');
plt.xlabel('Highest Parental Occupational Status');
plt.ylabel('Frequency');
Highest parental occupational status is the variable analyzed here, the data is almost evenly spread out with one peak at 20-30 interval. I plan on investigating how Highest parental occupational status affects the overall literacy of students.
df_pisa_final.Highest_parental_education_years.describe()
bin_edges = np.arange(0,df_pisa_final.Highest_parental_education_years.max()+2,2.5)
plt.hist(data=df_pisa_final, x='Highest_parental_education_years', bins= bin_edges, color='darkblue');
plt.xlabel('Highest Parental Education (years)');
plt.ylabel('Frequency');
The variable explored here is Highest Parental Education in years, this variable's distribution is left skewed with the peak at 15-17.5 years interval. I plan on investigating how Highest Parental Education affects the overall literacy of students.
df_pisa_final[['Computer_use_mins']].describe()
sb.boxplot(y=df_pisa_final['Computer_use_mins'], color='darkred');
bin_edges = np.arange(0,df_pisa_final.Computer_use_mins.max()+20,20)
plt.hist(data=df_pisa_final, x='Computer_use_mins', bins= bin_edges, color='darkblue');
plt.ylabel('Frequency');
plt.xlabel('Computer Use in (mins)');
The variable explored here is Computer Use in minutes, this variable's distribution is right skewed with the peak at 25-50 mins interval. I plan on investigating how Computer use affects the overall literacy of students.
df_pisa_final.info()
fig, ax = plt.subplots(4,3, figsize=(27, 28))
sb.countplot(data=df_pisa_final, x= 'OECD', color='orange', ax=ax[0][0]);
ax[0][0].title.set_text('Plot: 1')
sb.countplot(data=df_pisa_final, x= 'Gender', color='darkred', ax=ax[0][1]);
ax[0][1].title.set_text('Plot: 2')
sb.countplot(data=df_pisa_final, x= 'Perseverance_Give_up_easily', color='green', ax=ax[0][2]);
ax[0][2].title.set_text('Plot: 3')
ax[0][2].tick_params(labelrotation=10);
sb.countplot(data=df_pisa_final, x= 'Desktop_at_Home', color='orange', ax=ax[1][0]);
ax[1][0].title.set_text('Plot: 4')
sb.countplot(data=df_pisa_final, x= 'Portable_Laptop_at_Home', color='darkred', ax=ax[1][1]);
ax[1][1].title.set_text('Plot: 5')
sb.countplot(data=df_pisa_final, x= 'Tablet_Computer_at_Home', color='green', ax=ax[1][2]);
ax[1][2].title.set_text('Plot: 6')
sb.countplot(data=df_pisa_final, x= 'Internet_Connection', color='orange', ax=ax[2][0]);
ax[2][0].title.set_text('Plot: 7')
sb.countplot(data=df_pisa_final, x= 'Video_Games_Console', color='darkred', ax=ax[2][1]);
ax[2][1].title.set_text('Plot: 8')
sb.countplot(data=df_pisa_final, x= 'at_Home_Cell_phone_w/o_Internet', color='green', ax=ax[2][2]);
ax[2][2].title.set_text('Plot: 9')
sb.countplot(data=df_pisa_final, x= 'at_Home_Cell_phone_with_Internet', color='orange', ax=ax[3][0]);
ax[3][0].title.set_text('Plot: 10')
sb.countplot(data=df_pisa_final, x= 'Highest_educational_level_parents', color='darkred', ax=ax[3][1]);
ax[3][1].title.set_text('Plot: 11')
ax[3][1].tick_params(labelrotation=15)
sb.countplot(data=df_pisa_final, x= 'Class_repeated', color='green', ax=ax[3][2]);
ax[3][2].title.set_text('Plot: 12')
Plot 1: In this plot a count plot is plotted depicting the count of students from OECD nations vs non-OECD nations. There are more students in OECD nations than the non-OECD ones. I plan on investigating how this variable affects the overall literacy of students.
Plot 2: In this plot a count plot is plotted depicting the count of male students vs female students. There are marginally more female students than male students. I do plan on investigating this variable with the the overall literacy of students.
Plot 3: In this plot a count plot is plotted depicting the count of students and their preservence level. Majority of students do not give up easily according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 4: In this plot a count plot is plotted depicting the count of students who have desktop at home. Majority of students have desktop at home and they use it according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 5: In this plot a count plot is plotted depicting the count of students who have portable laptop at home. Majority of students have portable laptop at home and they use it according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 6: In this plot a count plot is plotted depicting the count of students who have tablet computer at home. Majority of students do not have tablet computer at home according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 7: In this plot a count plot is plotted depicting the count of students who have internet connection. Majority of students have internet connection and they use it according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 8: In this plot a count plot is plotted depicting the count of students who have video games console at home. Majority of students have video games console at home and they use it according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 9: In this plot a count plot is plotted depicting the count of students who have cellphone without internet at home. Majority of students have cellphone at home without internet and they use it according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 10: In this plot a count plot is plotted depicting the count of students who have cellphone with internet at home. Majority of students have cellphone at home with internet and they use it according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 11: In this plot a count plot is plotted depicting the count of parents' highest educational level. Majority of parents have atleast passed ISCED 5A or 6 level according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
Plot 12: In this plot a count plot is plotted depicting the count of students who have repeated a grade vs not repeated a grade. Majority of students did not repeat a grade according to the plot. I plan on investigating how this variable correlates with the overall literacy of students and other quantitative variables.
In this section, I investigated relationships between pairs of variables discussed earlier in the previous section.
df_pisa_final.info()
chosen_ids = np.random.choice(76946, replace=False, size=10992)
df_trimmed = df_pisa_final.iloc[chosen_ids]
df_trimmed.shape
plt.figure(figsize=(20,16))
result = df_pisa_final.groupby(["Country"])['Overall_Literacy'].aggregate(np.mean).reset_index().sort_values('Overall_Literacy', ascending= False)
sb.barplot(x='Overall_Literacy', y="Country", data=df_pisa_final, order=result['Country'], color='darkred');
China has students who have the highest overall literacy compared to other nations, whereas Jordan has students who have the lowest overall literacy compared to other nations.
Here, the variables Average Learning Time and Overall Literacy will be analyzed. There is weak positive correlation between the 2 factors. Sampling was used as using the entire dataframe clogged the plot with oversampling.
(df_pisa_final['Average_Learning_Time'].corr(df_pisa_final['Overall_Literacy'])).round(3)
plt.figure(figsize = [10, 6])
sb.regplot(data = df_trimmed, y = 'Overall_Literacy', x = 'Average_Learning_Time', x_jitter=0.4, scatter_kws={'alpha':.1}, fit_reg=False );
plt.xlabel('Average Learning Time in (minutes/week)');
plt.ylabel('Overall Literacy');
Here, the variables Highest Parental Occupational Status and Overall Literacy will be analyzed. There is weak to moderate positive correlation between the 2 factors. Sampling was used as using the entire dataframe clogged the plot with oversampling.
(df_pisa_final['Highest_parental_occupational_status'].corr(df_pisa_final['Overall_Literacy'])).round(3)
plt.figure(figsize = [10, 6])
sb.regplot(data = df_trimmed, y = 'Overall_Literacy', x = 'Highest_parental_occupational_status', y_jitter=0, scatter_kws={'alpha':.1}, fit_reg=False );
plt.xlabel('Highest Parental Occupational Status');
plt.ylabel('Overall Literacy');
Here, the variables Highest Parental Education measured in years and Overall Literacy will be analyzed. There is a weak to moderate positive correlation between the 2 factors. Scatter plot was not used as the plot of choice due to the discrete yet continuous nature of the Education factor. The y-scale is scaled from 300 to show the difference in the levels of the x axis factor.
(df_pisa_final['Highest_parental_education_years'].corr(df_pisa_final['Overall_Literacy'])).round(3)
plt.figure(figsize = [20, 10])
base_color = sb.color_palette()[1]
sb.barplot(data=df_pisa_final, x='Highest_parental_education_years', y='Overall_Literacy', color=base_color, ci='sd')#, order=comb_order)
plt.xticks(rotation=0);
plt.ylim(300,650)
plt.xlabel('Highest Parental Education Years');
plt.ylabel('Overall Literacy');
Here, the variables Grade measured on a scale of -3 to 2 and Overall Literacy will be analyzed. The relative grade index (GRADE) was computed to capture between-country variation. It indicates whether students are in the country’s modal grade ie (value of 0) or whether they are below or above the modal grade (+x grades, -x grades). There is a strong positive correlation between the 2 factors. Scatter plot was not used as the plot of choice due to the discrete yet continuous nature of the Grade factor.
base_color = sb.color_palette()[1]
sb.barplot(data=df_pisa_final, x='Grade', y='Overall_Literacy', color=base_color, ci='sd')
plt.xlabel('Grade');
plt.ylabel('Overall Literacy');
Here, the variables Computer Use (minutes) and Overall Literacy will be analyzed. There is a weak negative correlation between the 2 factors. Sampling was used as using the entire dataframe clogged the plot with oversampling.
(df_pisa_final['Computer_use_mins'].corr(df_pisa_final['Overall_Literacy'])).round(3)
plt.figure(figsize = [10, 6])
sb.regplot(data = df_trimmed, y = 'Overall_Literacy', x = 'Computer_use_mins', x_jitter=0.2, scatter_kws={'alpha':.1}, fit_reg=False );
plt.xlabel('Computer Use minutes');
plt.ylabel('Overall Literacy');
To analyze categorical factors, initial setup was done as shown below. All the categorical factors were plotted against Overall Literacy, to evaluate the effects of each factor on the overall literacy. Only those factors which show a relationship between overall literacy will be discussed below.
ordinal_var_dict = {'OECD': ['OECD','Non-OECD'],
'Gender': ['Female','Male'],
'Perseverance_Give_up_easily': ['Not much like me', 'Not at all like me','Somewhat like me','Mostly like me', 'Very much like me'],
'Desktop_at_Home': ['No',"Yes, but I don't use it",'Yes, and I use it'],
'Portable_Laptop_at_Home': ['No',"Yes, but I don't use it",'Yes, and I use it'],
'Tablet_Computer_at_Home': ['No',"Yes, but I don't use it",'Yes, and I use it'],
'Internet_Connection': ['No',"Yes, but I don't use it",'Yes, and I use it'],
'Video_Games_Console': ['No',"Yes, but I don't use it",'Yes, and I use it'],
'at_Home_Cell_phone_w/o_Internet': ['No',"Yes, but I don't use it",'Yes, and I use it'],
'at_Home_Cell_phone_with_Internet': ['No',"Yes, but I don't use it",'Yes, and I use it'],
'Highest_educational_level_parents': ['ISCED 1','ISCED 2','ISCED 3A, ISCED 4', 'ISCED 3B, C', 'ISCED 5B', 'ISCED 5A, 6'],
'Class_repeated': ['Did not repeat a grade','Repeated a grade']}
for var in ordinal_var_dict:
pd_ver = pd.__version__.split(".")
if (int(pd_ver[0]) > 0) or (int(pd_ver[1]) >= 21): # v0.21 or later
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
df_pisa_final[var] = df_pisa_final[var].astype(ordered_var)
else: # pre-v0.21
df_pisa_final[var] = df_pisa_final[var].astype('category', ordered = True,
categories = ordinal_var_dict[var])
Here, the variables Average Learning Time (minutes/week) and Class Repeated will be analyzed. Based on the plot, there is no distinctive observation made between the 2 variables.
sb.barplot(x='Class_repeated', y="Average_Learning_Time", data=df_pisa_final, color='darkred');
fig, ax = plt.subplots(4,3, figsize=(27, 28))
sb.violinplot(data=df_pisa_final, x='OECD', y='Overall_Literacy', color=sb.color_palette()[3], inner='quartile',ax=ax[0][0])
ax[0][0].title.set_text('Plot 1: OECD and Overall Literacy')
sb.violinplot(data=df_pisa_final, x='Gender', y='Overall_Literacy', color=sb.color_palette()[1], inner='quartile',ax=ax[0][1])
ax[0][1].title.set_text('Plot 2')
sb.violinplot(data=df_pisa_final, x='Class_repeated', y='Overall_Literacy', color=sb.color_palette()[2], inner='quartile',ax=ax[0][2])
ax[0][2].title.set_text('Plot 3: Class Repeated and Overall Literacy')
ax[0][2].tick_params(labelrotation=0);
sb.violinplot(data=df_pisa_final, x='Desktop_at_Home', y='Overall_Literacy', color=sb.color_palette()[3], inner='quartile',ax=ax[1][0])
ax[1][0].title.set_text('Plot 4')
sb.violinplot(data=df_pisa_final, x='Portable_Laptop_at_Home', y='Overall_Literacy', color=sb.color_palette()[1], inner='quartile',ax=ax[1][1])
ax[1][1].title.set_text('Plot 5')
sb.violinplot(data=df_pisa_final, x='Tablet_Computer_at_Home', y='Overall_Literacy', color=sb.color_palette()[2], inner='quartile',ax=ax[1][2])
ax[1][2].title.set_text('Plot 6')
sb.violinplot(data=df_pisa_final, x='Internet_Connection', y='Overall_Literacy', color=sb.color_palette()[3], inner='quartile',ax=ax[2][0])
ax[2][0].title.set_text('Plot 7: Internet Connection and Overall Literacy')
sb.violinplot(data=df_pisa_final, x='Video_Games_Console', y='Overall_Literacy', color=sb.color_palette()[1], inner='quartile',ax=ax[2][1])
ax[2][1].title.set_text('Plot 8')
sb.violinplot(data=df_pisa_final, x='at_Home_Cell_phone_w/o_Internet', y='Overall_Literacy', color=sb.color_palette()[2], inner='quartile',ax=ax[2][2])
ax[2][2].title.set_text('Plot 9')
sb.violinplot(data=df_pisa_final, x='at_Home_Cell_phone_with_Internet', y='Overall_Literacy', color=sb.color_palette()[3], inner='quartile',ax=ax[3][0])
ax[3][0].title.set_text('Plot 10')
sb.violinplot(data=df_pisa_final, x='Highest_educational_level_parents', y='Overall_Literacy', color=sb.color_palette()[1], inner='quartile',ax=ax[3][1])
ax[3][1].title.set_text('Plot 11: Highest Educational Level of Parents and Overall Literacy')
ax[3][1].tick_params(labelrotation=15)
sb.violinplot(data=df_pisa_final, x='Perseverance_Give_up_easily', y='Overall_Literacy', color=sb.color_palette()[2], inner='quartile',ax=ax[3][2])
ax[3][2].title.set_text('Plot 12: Perseverance and Overall Literacy')
ax[3][2].tick_params(labelrotation=15);
Observations:
Plot 1: In this plot a violin plot is plotted, where the x-axis holds OECD variable and the y-axis holds the overall literacy values. The plot depicts that students from OECD nations compared to students in non-OECD nations seems to have more or less a similar overall literacy.
Plot 3: In this plot a violin plot is plotted, where the x-axis holds repeated a grade vs not repeated a grade variable and the y-axis holds the overall literacy values. The plot depicts that students who did not repeat a grade did far better in overall literacy compared to students who did repeat a grade.
Plot 4 & 5: In these plots a violin plot is plotted, since both these plots are nearly identical hence it is best to discuss them together. Plot 4 shows the relationship between students having desktop at home and how that is correlated with overall literacy. Plot 5 shows the same thing but instead of desktop it looks at portable laptop at home. The x-axis holds 3 choices i)'No', ii) "Yes, but I don't use it", iii) 'Yes, and I use it' and the y-axis holds the overall literacy values. The plots depict that those students who had a desktop and/or a portable laptop at home did far better in overall literacy compared to students who did not have them.
Plot 7: In this plot a violin plot is plotted. The plot shows a relationship between students who have internet connection vs those who don't and how that correlated with overall literacy. The x-axis holds 3 choices i)'No', ii) "Yes, but I don't use it", iii) 'Yes, and I use it' and the y-axis holds the overall literacy values. The plot depicts that those students who have internet connection and use it did far better in overall literacy compared to students who did not have it or to those who did have it but don't use it.
Plot 11: In this plot a violin plot is plotted, where the x-axis holds parents' highest educational levels and the y-axis holds the overall literacy values of students. The plot depicts that as the parent's highest educational levels increase the students' overall literacy increased aswell.
Plot 12: In this plot a violin plot is plotted, where the x-axis holds preseverance levels of students and the y-axis holds the overall literacy values of students. The plot depicts that as the perseverance level decreases (giving up ability increases) resulting in students' overall literacy to decrease.
In this section plots of three or more variables were created to investigate the data even further.
Country vs Overall Literacy
Average Learning Time vs Overall Literacy (corr = 0.125)
Highest Parental Occupational Status vs Overall Literacy (corr = 0.351)
Highest Parental Education in years vs Overall Literacy (corr = 0.277)
Class Repeated and Overall Literacy
Internet Connection and Overall Literacy
Highest Educational Level of Parents and Overall Literacy
Perseverance and Overall Literacy
In this plot, the variables investigated are shown below:
- Highest Parental Occupational Status (Quantitative)
- Overall Literacy (Quantitative)
- Internet Connection (Categorical)
Having 2 Quantitative and 1 Categorical variable, a scatter plot would be best fit for this type of setup. Due to over sampling, random sample of the data was used to better depict the trend on the plot. On the x-axis is the variable "Highest Parental Occupational Status" and on y-axis is the variable "Overall Literacy" and with the color encoding being "Internet Connection".
As the Parents' occupational status increase so does the student's overall literacy, with that more and more students have access to internet. The right side of the plot has majority of students using internet whereas, as the parents' occupational status decrease so does the student's overall literacy, with that less and less students have access to internet.
g = sb.FacetGrid(data = df_trimmed, hue = 'Internet_Connection', size = 15, aspect=1.5)
g.map(plt.scatter, 'Highest_parental_occupational_status', 'Overall_Literacy', alpha=.8)
plt.title("Parent's Highest Occupational Status and its effect on Overall Literacy of Students given the Internet Assessibility")
plt.xlabel('Highest Parental Occupational Status')
plt.ylabel('Overall Literacy')
g.add_legend(title ='Internet Connection');
plt.rcParams.update({'font.size': 17});
In this plot, the variables investigated are shown below:
- Highest Parental Occupational Status (Quantitative)
- Overall Literacy (Quantitative)
- Class Repeated (Categorical)
Having 2 Quantitative and 1 Categorical variable, a scatter plot would be best fit for this type of setup. Due to over sampling, random sample of the data was used to better depict the trend on the plot. On the x-axis is the variable "Highest Parental Occupational Status" and on y-axis is the variable "Overall Literacy" and with the color encoding being "Class Repeated".
As the Parents' occupational status increase so does the student's overall literacy, with that less and less students have repeated a class. The right side of the plot has majority of students who did not repeat a class whereas, as the parents' occupational status decrease so does the student's overall literacy, with that more and more students repeated a grade.
g = sb.FacetGrid(data = df_trimmed, hue = 'Class_repeated', size = 16, aspect=1.5)
g.map(plt.scatter, 'Highest_parental_occupational_status', 'Overall_Literacy', alpha=1)
plt.title("Parent's Highest Occupational Status and its effect on Overall Literacy of Students and Repeated Class")
plt.xlabel('Highest Parental Occupational Status')
plt.ylabel('Overall Literacy')
g.add_legend(title ='Class Repeated?');
plt.rcParams.update({'font.size': 16});
In this plot, the variables investigated are shown below:
- Highest Parental Occupational Status (Quantitative)
- Overall Literacy (Quantitative)
- Grade (Quantitative)
Having 3 Quantitative variables, a scatter plot would be best fit for this type of setup. Due to over sampling, random sample of the data was used to better depict the trend on the plot. On the x-axis is the variable "Highest Parental Occupational Status" and on y-axis is the variable "Overall Literacy" and with the color encoded bar on the right representing "Grade".
The variable Grade is compared to modal grade in country. The relative grade index (Grade) was computed to capture between-country variation. It indicates whether students are in the country’s a modal grade ie (value of 0) or whether they are below or above the modal grade (-x grades, +x grades).
As the Parents' occupational status increase so does the student's overall literacy, with that students' modal grades have stayed the same as the the country's modal grade if not few of the grades being above the modal grade. The right side of the plot has majority of students who are in the country’s modal grade if not above the modal grade. Whereas, as the parents' occupational status decrease so does the student's overall literacy, with that more and more students are below the modal grade.
plt.figure(figsize = [24, 16])
plt.scatter(data = df_trimmed, x = 'Highest_parental_occupational_status', y = 'Overall_Literacy', c = 'Grade',
cmap = 'icefire_r', alpha=1)
plt.title("Parent's Highest Occupational Status and its effect on Overall Literacy of Students and Grade")
plt.xlabel('Highest Parental Occupational Status')
plt.ylabel('Overall Literacy')
plt.colorbar(label = 'Grade');
#mako_r
In this plot, the variables investigated are shown below:
- Highest Parental Occupational Status (Quantitative)
- Overall Literacy (Quantitative)
- Highest Parental Education in years (Quantitative)
Having 3 Quantitative variables, a scatter plot would be best fit for this type of setup. Due to over sampling, random sample of the data was used to better depict the trend on the plot. On the x-axis is the variable "Highest Parental Occupational Status" and on the y-axis is the variable "Overall Literacy" and with the color bar encoding representing "Highest Parental Education in years".
As visible in the plot Parents' occupational status is positively correlated with Highest Parental Education with an increase in both the variables there is an increase in the students' overall literacy. As visible on the right side of the plot, students' overall literacy tends to be the highest when parental occupational status and parents' education in years is at the maximum.
plt.figure(figsize = [25, 16])
plt.scatter(data = df_trimmed, x = 'Highest_parental_occupational_status', y = 'Overall_Literacy', c = 'Highest_parental_education_years',
cmap = 'mako_r', alpha=1)
plt.title("Parent's Highest Occupational Status and its effect on Overall Literacy of Students and Highest Parental Education")
plt.xlabel('Highest Parental Occupational Status')
plt.ylabel('Overall Literacy')
plt.colorbar(label = 'Highest Parental Education (year)');
In this plot, the variables investigated are shown below:
- Class Repeated (Categorical)
- Overall Literacy (Quantitative)
- Perseverance (Categorical)
Having 2 categorical and 1 Quantitative variables, a barplot would be the best fit for this type of setup. On the x-axis is the variable "Class Repeated" and on y-axis is the variable "Overall Literacy" and with the color encoding on the right representing "Perseverance" scale.
As visible in the plot students who didn't give up easily tend to have the highest overall literacy given they didn't repeat a grade/class. For the variable "Class Repeated" similar downwards staircase pattern is found from the left most bar to the right most bar depicting that as students give up easily the lower the overall literacy. This plot gives a way to investigating another relatioship that is to find out the count of students for the respective perseverance level and class repeated category. This plot can be found below (plot 1B).
plt.figure(figsize = [15, 12])
ax = sb.barplot(data = df_pisa_final, x = 'Class_repeated', y = 'Overall_Literacy', hue = 'Perseverance_Give_up_easily', palette='viridis')
ax.legend(loc = 1, ncol = 1, framealpha = 1, title = 'Perseverance (Give up easily)')
plt.title('Plot 1A: Effect of Perseverance on Class Repeated and Overall Literacy')
plt.xlabel('Class Repeated')
plt.ylabel('Overall Literacy')
ax.set_ylim([320,550]);
In this plot, the variables investigated are shown below:
- Class Repeated (Categorical)
- Count of Students (Quantitative)
- Perseverance (Categorical)
Continuing from the previous plot, this plot has the same configurations with the exception of the y-axis where this axis holds "Count of Students" variable.
There is an interesting observation, although normally you think those who are giving up easily tend to repeat a grade/class. But as shown below as per the survey results, the count of students giving up easily given they repeated a grade is much less than those who give up easily given they didn't repeat a grade. To combat this irregular observation, I would suggest more data samples to be gathered and to evaluate this factor more thoroghly.
plt.figure(figsize = [15, 10])
ax = sb.countplot(data = df_pisa_final, x = 'Class_repeated', hue = 'Perseverance_Give_up_easily', palette='viridis')
ax.legend(loc = 1, ncol = 1, framealpha = 1, title = 'Perseverance (Give up easily)');
plt.title('Plot 1B: Effect of Perseverance on Class Repeated and Count of Students');
plt.xlabel('Class Repeated')
plt.ylabel('Count of Students');
In this plot, the variables investigated are shown below:
- Class Repeated (Categorical)
- Overall Literacy (Quantitative)
- Highest Educational Level of Parents (Categorical)
Having 2 categorical and 1 Quantitative variables, a barplot would be the best fit for this type of setup. On the x-axis is the variable "Class Repeated" and on y-axis is the variable "Overall Literacy" and with the color encoding on the right representing "Highest Educational Level of Parents" scale.
As visible in the plot, students with parents' who have high education level tend to have higher overall literacy given those students did not repeat a grade. As for those who did repeat a grade similar pattern is followed.
A surprising observation was found for students who did not repeat a grade, those students had higher overall literacy for whose parents had atleast an ISCED 3, ISCED 4 education level as compated to those parents who had ISCED 3B, C and even ISCED 5B level as shown in the plot below. Common logic says that as education level of parents increase so should the overall literacy of students but an uncommon observation was found as discussed earlier. Similar patterns can be found for the students who did repeat a grade but instead student whose parents had atleast an ISCED 3B education level have higher overall literacy than whose parents had ISCED 5B educationn level as shown in the plot below.
plt.figure(figsize = [13, 10])
ax = sb.barplot(data = df_pisa_final, x = 'Class_repeated', y = 'Overall_Literacy', hue = 'Highest_educational_level_parents', palette='viridis_r')
ax.legend(loc = 1, ncol = 1, framealpha = 1, title = 'Highest Educational Level Parents')
plt.title("Effect of Parent's Educational Level on Class Repeation and Overall Literacy")
plt.xlabel('Class Repeated')
plt.ylabel('Overall Literacy')
ax.set_ylim([350,550]);
In this plot, the variables investigated are shown below:
- Class Repeated (Categorical)
- Overall Literacy (Quantitative)
- Internet Connection (Categorical)
Having 2 categorical and 1 Quantitative variables, a boxplot is used. On the x-axis is the variable "Class Repeated" and on y-axis is the variable "Overall Literacy" and with the 3 faceted boxplot having three options for the categorical variable called "Internet Connection".
As visible in the plot regardless of the students having repeated a class or not, they have higher overall literacy given they had internet connection and used it as compared to those who had internet connection but didn't use it or those who didn't have internet connection at all.
A surprising observation found here was that those students eventho had repeated a grade had a higher overall literacy given they had and used internet connection compared to students who repeated or not repeated a class given they either had no internet connection or they had internet connection but didnt use it.
g = sb.FacetGrid(data = df_pisa_final, col = 'Internet_Connection', size = 7, col_wrap=3)
g.map(sb.boxplot, 'Class_repeated', 'Overall_Literacy', color='darkred');
#plt.title("Effect of having an Internet Connection on Class Repeation and Overall Literacy")
g.axes[0].set_xlabel('Class Repeated')
g.axes[1].set_xlabel('Class Repeated')
g.axes[2].set_xlabel('Class Repeated')
g.axes[0].set_ylabel('Overall Literacy');
g.axes[0].set_title("Internet Connection: No");
g.axes[1].set_title("Internet Connection: Yes, but I don't use it");
g.axes[2].set_title("Internet Connection: Yes, and I use it");
In this plot, the variables investigated are shown below:
- Class Repeated (Categorical)
- Count of Students (Quantitative)
- Internet Connection (Categorical)
Having 2 categorical and 1 Quantitative variables, a barplot is used. On the x-axis is the variable "Internet Connection" and on y-axis is the variable "Count of Students" and the color encoding representing "Repeat a Grade" variable.
There are more number of students who repeated a grade given they have and use an internet connection compared to the students who either have internet connection but didn't use it or those who do not have internet connection altogether. This calls on to further investigate these variables but instead of the y-axis being the count variable I chose to use "Average Learning Time" variable to further explore a trend or a relationship worth discussing in the next plot (Plot 2B).
plt.figure(figsize = [9, 10])
ax = sb.countplot(data = df_pisa_final, x = 'Internet_Connection', hue = 'Class_repeated', palette='viridis')
ax.legend(loc = 2, ncol = 1, framealpha = 1, title = 'Repeat a Grade?')
plt.title("Effect of Internet Accessibility and Usage on Class Repeation and Count of Student")
plt.xlabel('Internet Connection')
plt.ylabel('Count of Student');
#plt.rcParams.update({'font.size': 17});
In this plot, the variables investigated are shown below:
- Class Repeated (Categorical)
- Average Learning Time in mins per Week (Quantitative)
- Internet Connection (Categorical)
Having 2 categorical and 1 Quantitative variable, a barplot is used. On the x-axis is the variable "Internet Connection" and on y-axis is the variable "Average Learning Time" and the color encoding representing "Repeat a Grade" variable.
Those students who had and used internet connection tend to have the least amount of average learning time regardless of having repeated or not repeated a grade in comparison to the students who either have internet connection but didn't use it or those who do not have internet connection altogether. Students who had no internet connection had spent on average more learning time compared to those students who had internet connection and do not use it or those who have internet connection and use it.
plt.figure(figsize = [9, 10])
ax = sb.barplot(data = df_pisa_final, x = 'Internet_Connection', y = 'Average_Learning_Time', hue = 'Class_repeated', palette='viridis')
ax.legend(loc = 9, ncol = 3, framealpha = 1, title = 'Repeat a Grade?')
ax.set_ylim([190,220]);
plt.title("Effect of Internet Accessibility and Usage on Class Repeation and Average Learning Time in mins per week")
plt.xlabel('Internet Connection')
plt.ylabel('Average Learning Time in mins (per Week)');
plt.rcParams.update({'font.size': 10});
[1] http://www.oecd.org/pisa/pisaproducts/MS12_StQ_FORM_UH_ENG.pdf
[2] http://www.oecd.org/pisa/pisaproducts/CBA12_cogn_codebook.pdf
[3] http://www.oecd.org/pisa/pisaproducts/CBA12_cogs_codebook.pdf
[4] https://stackoverflow.com/questions/38085547/random-sample-of-a-subset-of-a-dataframe-in-pandas
[5] https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
[6] https://stackoverflow.com/questions/34782063/how-to-use-pandas-filter-with-iqr
[7] https://gist.github.com/fomightez/bb5a9c727d93d1508187677b4d74d7c1
[8] https://stackoverflow.com/questions/3777861/setting-y-axis-limit-in-matplotlib
[9] https://seaborn.pydata.org/tutorial/color_palettes.html
[10] https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot
[11] https://cduvallet.github.io/posts/2018/11/facetgrid-ylabel-access
[12] https://www.oecd.org/pisa/sitedocument/PISA-2015-Technical-Report-Chapter-9-Scaling-PISA-Data.pdf
[13] https://www.oecd.org/pisa/pisaproducts/PISA12_stu_codebook.pdf
[14] http://www.oecd.org/pisa/pisaproducts/PISA%202012%20framework%20e-book_final.pdf
[15] http://www.oecd.org/pisa/aboutpisa/
[16] http://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf
[17] https://en.wikipedia.org/wiki/International_Standard_Classification_of_Education
[18] https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte
[19] https://largescaleassessmentsineducation.springeropen.com/articles/10.1186/s40536-020-00086-x#Sec38
[24] https://thispointer.com/pandas-drop-rows-from-a-dataframe-with-missing-values-or-nan-in-columns/
[26] https://dev.to/thalesbruno/subplotting-with-matplotlib-and-seaborn-5ei8
[27] https://stackoverflow.com/questions/25239933/how-to-add-title-to-subplots-in-matplotlib
#df_pisa_final.to_csv('df_pisa_final.csv')
#df_trimmed.to_csv('df_trimmed.csv')